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ABSTRACT 

PolyChord is a novel nested sampling algorithm tailored for high-dimensional pa¬ 
rameter spaces. This paper coincides with the release of PolyChord vl.3, and pro¬ 
vides an extensive account of the algorithm. PolyChord utilises slice sampling at 
each iteration to sample within the hard likelihood constraint of nested sampling. 
It can identify and evolve separate modes of a posterior semi-independently, and 
is parallelised using OPENMPI. It is capable of exploiting a hierarchy of parame¬ 
ter speeds such as those present in CosmoMC and CAMB, and is now in use in the 
CosmoChord and ModeChord codes. PolyChord is available for download at: 
http://ccpforge.cse.rl.ac.uk/gf/project/polychord/ 

Key -words: methods: data analysis — methods: statistical 


1 INTRODUCTION 

Over the past two decades, the quantity and quality of astro- 
physical and cosmological data has increased substantially. 
In response to this, Bayesian methods have been increasingly 
adopted as the standard inference procedure. 

Bayesian inference consists of parameter estimation and 
model comparison. Parameter estimation is generally per¬ 
formed using Markov-Chain Monte-Carlo (MCMC) meth¬ 
ods, such as the Metropolis-Eastings (ME) algorithm and 
its variants (MacKay 2002). In order to perform model com¬ 
parison, one must calculate the evidence', a high-dimensional 
integration of the likelihood over the prior density (Sivia & 
Skilling 2006). ME methods cannot compute this on a usable 
timescale, hindering the use of Bayesian model comparison 
in cosmology and astroparticle physics. 

A contemporary methodology for computing evidences 
and posteriors simultaneously is provided by nested sam¬ 
pling (Skilling 2006). This has been successfully imple¬ 
mented in the now widely adopted algorithm Multi- 
Nest (Feroz & Eobson 2008; Feroz et al. 2009, 2013). Mod¬ 
ern cosmological likelihoods now involve a large number of 
parameters, with a hierarchy of speeds. MultiNest strug¬ 
gles with high-dimensional parameter spaces, and is unable 
to take advantage of this separation of speeds. PolyChord 
aims to address these issues, providing a means to sam¬ 
ple high-dimensional spaces across a hierarchy of parameter 
speeds. 

The layout of the paper is as follows: Section 2 is a 
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general overview of parameter estimation and model selec¬ 
tion in the context of Bayesian Inference. In Section 3 we 
describe Skilling’s (2006) nested sampling meta-algorithm. 
We overview the historical implementations of nested sam¬ 
pling in Section 4 and provide an account of Neal’s (2000) 
slice sampling technique. We describe the PolyChord algo¬ 
rithm in detail in Section 5 and demonstrate its efficacy on 
toy and cosmological problems in Section 6. Section 7 con¬ 
cludes the paper. In addition we provide three appendices. 
Appendix A describes the procedure for implementing new 
prior distributions within the context of nested sampling. 
Appendices B & C describe the mathematics of inferring 
evidences from the samples produced by nested sampling. 

This paper is an extensive overview of our algo¬ 
rithm, which is now in use in several cosmological applica¬ 
tions (Planck Collaboration XX 2015). A briefer introduc¬ 
tion can be found in Eandley et al. (2015). 

PolyChord is available for download from the link at 
the end of the paper. 


2 BAYESIAN INFERENCE 

In this section, we describe the key concepts of Bayesian 
inference necessary for understanding the utility of Poly¬ 
Chord. For readers experienced in the field, this section 
serves to establish nomenclature and notation. For a full 
discussion of Bayesian inference, we recommend Sivia & 
Skilling (2006) or part IV of MacKay (2002). 


© 2015 RAS 


2 W.J. Handley et. al 


2.1 Nomenclature 

Scientific theory is concerned with the construction of pre¬ 
dictive models in the context of some dataset D. A typical 
model A4 contains a set of variable parameters 8m ■ One 
may use M to calculate the probability of observing the 
data given a specific parameter choice: 

P{V\eM,M)=£. ( 1 ) 


This distribution on V is termed the likelihood C. From a 
Bayesian standpoint a model must also specify our initial 
degree of belief on the parameters 9m '■ 

p{eM\M) = 'K, ( 2 ) 

This distribution on 9m is termed the prior n. Typically 
this is a parametric distribution which quantifies our initial 
assumptions on the scale and spread of the parameters^. 

The likelihood (1) is conditioned on a set of chosen val¬ 
ues for the model parameters 9m ■ One may marginalise out 
the dependence on 9m by integrating over the prior distri¬ 
bution: 

P{V\M)=Z = J P{V\9M,M)P{9M\M)d9M. (3) 

This quantity is termed the evidence Z, or marginalised 
likelihood, and gives the probability of observing the data 
T>, conditioned on the model M. Suppressing explicit de¬ 
pendence on the model, the evidence computation can be 
written as: 

Z = j C(9)-k{ 9) d9. (4) 


2.2 Parameter estimation 


If the prior has been specified, Bayes theorem allows us to 
invert the conditioning in equation (1) and find the posterior 
V by combining the likelihood, prior and evidence: 


P{9m\V,M) 


P{V\9m,M)^{0m\M) 

P{V\M) 


( 5 ) 


which is schematically written as: 


V = 


C X TT 

z 


( 6 ) 


This describes how our initial knowledge rr of the parameters 
updates to V in light of the data T>. Calculation of the poste¬ 
rior 7^(6) is the domain of parameter estimation, and in high 
dimensions is best performed by sampling the space with a 
Markov-Chain Monte-Carlo approach (MCMC). Examples 
include Metropolis-Hastings, Gibbs sampling and Slice sam¬ 
pling. For the most part, the evidence Z is ignored during 
such calculations, and one works with an unnormalised pos¬ 
terior V oc £ X n. 


^ Common examples include a uniform distribution between two 
bounds, or a Gaussian distribution with specified mean and vari¬ 
ance. 



iso-likelihood contours of a two-dimensional multi-modal likeli¬ 
hood function C(9). Each contour encloses some fraction of the 
prior X, indicated by colour. Right: Likelihood T as a function of 
the volume X enclosed by the contour. The evidence is the area 
under this curve. 


2.3 Model comparison 


Of equal importance in scientific investigation is model 
comparison. Typically one has multiple competing models 
{A4i, A42, ■ ■ ■ }, each with their own parameters and as¬ 
sumptions. The data D are able to decide on the relative 
merits of each of these models via Bayes theorem: 


P(Mi\V) 


P(V\Mi)P{Mi) 

P{V) 

TTi 

Ej Zjnj ■ 


( 7 ) 

( 8 ) 


In contrast to parameter estimation, the evidences of each 
model Zi take the leading role in model comparison. 
One typically will choose uniform priors on the models, 
TTi = P(AIi) = const, and then choose to use the model with 
the highest evidence. However, when evidences are similar in 
magnitude, the correct Bayesian approach is to make infer¬ 
ences by marginalising over all models considered. If there is 
a common derived parameter y, with marginalised posterior 
PlylV, A4i) then one may produce the fully marginalised 
posterior: 


P{y\V) 


j:.P{y\V,M,)ZiTVi 

Ej ZjTVj 


( 9 ) 


This fully Bayesian approach has been historically under 
utilised due to the difficulties in computing the evidence 
numerically from the integral (4). 


3 NESTED SAMPLING 

PolyChord falls into a category of sampling algorithms 
known as nested sampling. In order to explain the advances 
that PolyChord has made, it is first necessary to describe 
the nested sampling meta-algorithm. Readers familiar with 
the theory may skip to Section 4. 

Computing the evidence (4) typically involves an inte¬ 
gral over a high-dimensional parameter space, only a small 
fraction of which contributes to Z. The size and position 
of the region surrounding the peak(s) will not be known a 
priori, and in high dimensions is hard to find (see Figure 1). 

Algorithms need to be able quickly to compress the pa¬ 
rameter space from the prior onto the posterior. In order 
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to perform parameter estimation it needs to produce sam¬ 
ples from the posterior, and to perform model comparison 
it should be able to calculate the evidence. Nested sam¬ 
pling (Skilling 2006) offers a means of doing all of these 
tasks simultaneously. 


3.1 Compressing the space 


Nested sampling maintains a population of nnve live points 
within a region of the parameter space. These points are 
sequentially updated so that the region that they occupy 
contracts around the peak(s) of the posterior. 

One begins by sampling nuve points from the prior dis¬ 
tribution 7r(0). At iteration i, the point with the lowest like¬ 
lihood Li is deleted, and then replaced by a new point. The 
new point is drawn from the prior, subject to the constraint 
that its likelihood is greater than Li. 

The fraction of the prior contained within an iso¬ 
likelihood contour T(0) = T is denoted the prior volume: 


x{L) = [ 7r{e)de. ( 10 ) 

Jc(0)>£ 

Since the live points are always drawn uniformly from 
at iteration i the volume containing the live points will con¬ 
tract on average by a factor of niive/(u-uve + !)• Initially the 
prior volume is 1, so at iteration i: 


{X^) = 


f '^live \ 
V^live + 1/ 


e ' 


( 11 ) 


The live points thus compress the prior exponentially. As the 
nested sampling run progresses, one is left with a sequence 
of discarded points (termed dead points). Each dead point 
will have a set of parameter values 6i, a likelihood Li and 
an estimated prior volume Xi. 


3.2 Evidence estimation 

We can use the dead and live points to estimate the evidence. 
By differentiating the prior volume (10), we may re-write the 
evidence calculation (4) as an integral over a single variable: 

[ L{X)dX. (12) 

Jo 

This is detailed graphically in Figure 1. We may thus esti¬ 
mate the evidence by quadrature: 

Zf=i (13) 

i^dead 

where for simplicity we take Wi = Xi-i — Xi. Of course, 
this is only an estimate, since we are inferring the mean 
values (Xi) from the sampling procedure. One may however 
estimate the error in our inference, the full details of which 
can be found in Appendix B. 


3.3 Parameter estimation 


Nested sampling can also perform parameter estimation by 
using the dead and live points as samples from the poste¬ 
rior, provided that the ith point is given the importance 
weighting: 


Pi = 


UJiLi 

Z 


(14) 



logX 


Figure 2. Plot of a generic likelihood as a function of the prior 
volume C(X). In high dimensions, the likelihood is only visible 
if plotted against logX (dashed curve). However, the evidence is 
better visualised by plotting X log(X) (solid curve). The area un¬ 
der the solid curve corresponds to the evidence. The magnitude 
of the solid curve is proportional to the importance weighting. 
Nested sampling proceeds from high to low volumes. After some 
time, the live points no longer contribute significantly to the evi¬ 
dence, and the algorithm terminates at this point. 


where Wi is the prior volume of the shell in which point i 
was sampled. 


3.4 Algorithm termination 

As nested sampling proceeds, the likelihoods Li monoton- 
ically increase, but the weights Wi monotonically decrease. 
This results in a peak in importance weights (14) that can 
be seen in Figure 2. We terminate the algorithm once the re¬ 
maining posterior mass (white region) left in the live points 
is some small fraction of the currently calculated evidence 
(dark region). The posterior mass left in the live points at 
iteration i can be estimated by: 

^live « (15) 

where the average is taken over the live points. Since this is 
typically an underestimate at early times, this will not cause 
premature termination. 


3.5 The unit hypercube 

Each iteration of nested sampling requires one to sample 
from the prior (subject to a hard likelihood constraint). Typ¬ 
ically, priors are defined in terms of simple analytic functions 
such as uniform or Gaussian distributions, and may be sam¬ 
pled using inverse transform sampling. 

In the one-dimensional case, this amounts to converting 
a uniform random variable (which are easy to generate) into 
a variable sampled from a general distribution f{6). One 
first finds its cumulative distribution function (CDF): 

e 

F{0)= I f{e')d9', (16) 

— oo 

computes the inverse of the CDF, and then applies this func¬ 
tion to a uniform random variable x ~ U{0, 1) to generate a 
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variable 0 — F~^{x), which is distributed according to f{0). 
In the general D-dimensional case, one calculates D condi¬ 
tional distributions {Fi : i = 1..., D} by marginalising over 
parameters 6j,j > i, and conditioning on j < i: 


where: 


f f,(9)d0i+i . ..d^N 


, 6»i) = 


J fdO)dei...d0N 


(17) 


(18) 


This generates a set of relations sequentially transforming 
D uniform random variables {xi} into {0i} distributed ac¬ 
cording to f{0). 

In many cases, the prior n{0) is separable, and the above 
equations are easily calculated. For sections of the param¬ 
eters which are not separable, the calculation can become 
more involved. We include a few demonstrations of this pro¬ 
cedure in Appendix A. 

Nested sampling can thus be performed in the unit D- 
dimensional hypercube, x £ [0,1]^, defining a new likeli¬ 
hood function via £-{0) = T(F“^(x)). This has numerous 
advantages, the first being that one only needs to be able to 
generate uniform random variables in [0,1]. The second is 
more subtle; it is more natural to define a distance metric in 
the unit hypercube than in the physical space. Unit hyper¬ 
cube variables all have the same dimensionality: probability. 


4 SAMPLING WITHIN AN ISO-LIKELIHOOD 

CONTOUR 

Now that the nested sampling meta-algorithm has been de¬ 
scribed, we briefly review the various instantiations that ex¬ 
ist, and introduce PolyChord as an algorithm utilising slice 
sampling at each iteration to generate new live points. 

The most challenging aspect of nested sampling is draw¬ 
ing a new point from the prior subject to the hard likelihood 
constraint £ > jCi. This may be done in a variety of ways, 
and distinguishes the various historical implementations. 


4.1 Previous Methods 

For some problems, the iso-likelihood contour is known an¬ 
alytically, allowing one to construct a sampling procedure 
specific to that problem. This is demonstrated by Keeton 
(2011), and can be useful for testing nested sampling’s the¬ 
oretical behaviour. In most cases, however, the likelihood 
contour is unknown a-priori, so a more numerical approach 
must be taken. 

The MultiNest algorithm (Feroz & Hobson 2008; 
Feroz et al. 2009, 2013) samples by using the live points 
to construct a set of intersecting ellipsoids which together 
aim to enclose the likelihood contour, and then samples by 
rejection sampling within the ellipsoids. Whilst being an ex¬ 
cellent algorithm for modest numbers of parameters, any re¬ 
jection sampling algorithm has an exponential scaling with 
dimensionality that eventually emerges. 

An alternative approach (the one initially envisaged by 
Skilling) is to sample with the hard likelihood constraint 
using a Markov-Chain based procedure. One makes several 


steps according to some proposal distribution until one is 
satished an independent sample is produced. This has signif¬ 
icant advantages over a rejection-based approach, the most 
obvious being that the scaling with dimensionality is polyno¬ 
mial rather than exponential. In rejection sampling, points 
are drawn until one is found within the likelihood contour 
(often with extremely low efficiency). Using a Markov-chain 
approach however, (correlated) points are continually gen¬ 
erated within the contour, until one is happy that a sample 
independent from the initial seed has been generated. These 
“intra-chain points” which we term phantom points have the 
potential to provide a great deal more information. 

A traditional Metropolis-Hastings (MH) or Gibbs sam¬ 
pling approach may be utilised, but in general such algo¬ 
rithms are ill-suited to sampling from a hard likelihood con¬ 
straint without a significant amount of tuning of a proposal 
matrix. This is examined in section 6 of Feroz & Hobson 
(2008). 

Galilean (Hamiltonian) sampling (Feroz & Skilling 
2013; Betancourt 2011) improves upon the traditional MH 
sampler by using proposal points generated by reflecting off 
iso-likelihood contours. This however requires gradients to 
be calculated, and can become inefficient if the step size is 
chosen incorrectly, or if the contour has a shape which is 
difficult to ‘step back into’ 

Diffusive nested sampling (Brewer et al. 2009) is an al¬ 
ternative and promising variation on Skilling’s (2006) algo¬ 
rithm, which utilises MGMC to explore a mixture of nested 
probability distributions. Since it is MCMC based, it scales 
well with dimensionality. In addition, it can deal with multi¬ 
modal and degenerate posteriors, unlike traditional MCMC. 
It does however have multiple tuning parameters. 


4.2 Slice sampling 

We have found that a Markov-Chain based procedure utilis¬ 
ing Neal’s (2000) slice sampling at each step is well suited to 
sampling uniformly within an iso-likelihood contour. Rad¬ 
ford Neal initially proposed slice sampling as an effective 
methodology for generating samples numerically from a 
given posterior 'P(0). One first chooses a ‘slice’ (or probabil¬ 
ity level) Fo uniformly within [0,'Pmax]. One then samples 
uniformly within the 0-region defined by F{0) > Vq. The 
similarity with the iso-likelihood contour sampling required 
by nested sampling should be clear. In the one-dimensional 
case, he suggests the sampling procedure detailed in Fig¬ 
ure 3. 

This procedure for sampling within a likelihood bound 
is ideal for nested sampling. It samples uniformly with mini¬ 
mal information: an initial bound size w, and a point xo that 
is within the contour. In general w must be chosen so that 
it is roughly the size of the bound, but if one overestimates 
it then the bounds will contract exponentially. Indeed, one 
may consider this as being equivalent to a prior space com¬ 
pression (11) with niive = Udims = 1. As a starting point, 
one may use one of the live points, which is already uni¬ 
formly sampled. Since the procedure above satisfies detailed 
balance, this will produce a point which is also uniformly 
sampled within the iso-likelihood contour. 

In higher dimensions, Neal (2000) suggests a variety of 
MGMC-like methods. The simplest of these is implemented 
by sampling each of the parameter directions in turn. Since 
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Figure 3. Slice sampling in one dimension. Given a probability 
level (or slice) 'Po, slice sampling samples within the horizontal 
region defined hy P > Pq. Prom an initial point xq within the 
slice {P{xq) > Po), a new point x\ is generated within the slice 
with a distribution P{xi\xo). External bounds are first set on 
the slice L < xo < R hy uniformly expanding a random initial 
bound of width w until they lie outside the slice (Neal terms this 
the stepping out procedure). x\ is then sampled uniformly within 
these bounds. If xi is not in the slice, then L or ij is replaced 
with XI, ensuring that xq is still within the slice. This procedure 
is guaranteed to generate a new point xi, and satisfies detailed 
balance P{xo\xi ) = P{xi |xo). Thus, if xq is drawn from a uniform 
distribution within the slice, so is xi. 

each one-dimensional slice requires ~ O {a few) likelihood 
calculations, the number of likelihood calculations required 
scales linearly with dimensionality. Multi-dimensional slice 
sampling has many of the benefits of a traditional MH ap¬ 
proach, and uses a proposal distribution which is much more 
efficient at sampling a hard likelihood constraint. 

Aitken & Akman (2013) have already applied this pro¬ 
cedure to nested sampling. This works exceptionally well for 
cases in which the parameters are non-degenerate. However, 
this becomes inefficient in the case of correlated parameters, 
or curving degeneracies. 


5 THE PolyChord ALGORITHM 

PolyChord implements several novel features compared 
to Aitken & Akman’s (2013) slice-based nested sampling. It 
utilises slice sampling in a manner that uses the information 
present in the live and phantom points to deal with corre¬ 
lated posteriors. PolyChord also uses a general clustering 
algorithm that identifies and evolves separate modes of the 
posterior semi-independently, and infers local evidence val¬ 
ues. In addition, it has the option of implementing fast-slow 
parameters, which is extremely effective in its combination 
with CosmoMC (Lewis & Bridle 2002). This is termed Cos- 
moChord, which may be downloaded from the link at the 
end of the paper. 

The algorithm is written in FORTRAN95 and paral¬ 
lelised using OPEnMPI. It is optimised for the case where the 
dominant cost is the generation of a new live point. This is 
frequently the case in astrophysical applications, either due 
to high dimensionality, or to costly likelihood evaluation. 

5.1 Multi-dimensional slice sampling 

At each iteration i of nested sampling, we generate a new 
randomly sampled point within the iso-likelihood contour jCi 


by our variant of D-dimensional slice sampling. Slice sam¬ 
pling is performed in the unit hypercube with hypercube 
coordinates denoted in bold (x). 

At each iteration i of the nested sampling algorithm, one 
of the live points is chosen at random as a start point for a 
new chain with hypercube coordinate xq. We then make a 
one-dimensional slice sampling step (Figure 3) with initial 
width in in a random direction no chosen from a probability 
distribution P(n). This generates a new point xi which is 
uniformly sampled in the unit hypercube, but is correlated 
to Xq. This process is repeated rirepeats times, with x^-i 
forming the start point for a slice along fij-i to produce 
Xj . This procedure is illustrated in the right hand half of 
Figure 4. 

Since the probability of drawing Xj from x^-i is the 
same as the probability of drawing Xj_i from Xj , this pro¬ 
cedure satisfies detailed balance. Thus, the resulting chain 
will ergodically be uniformly distributed within the iso¬ 
likelihood contour. This also applies to multi-modal poste¬ 
riors, with the chance of jumping out a mode being equal to 
the chance of jumping back in. 

The length of the chain rirepeats should be large enough 
so that the final point of the chain is decorrelated from the 
start point. This final point may now be considered to be 
a new uniformly sampled point from the prior distribution 
subject to the hard likelihood constraint. The intermedi¬ 
ate points are saved and stored as phantom points. Whilst 
phantom points are correlated, they are useful in providing 
additional information and posterior points. 

There are several elements of this which are left un¬ 
determined, namely the probability distribution P(n), the 
initial width w, and the chain length n-repeats- These issues 
are addressed in the next section. 

5.2 Contour whitening 

In order to determine an optimal P(n) and w, an algorithm 
will need some knowledge of the contour in which the chain 
is progressing. This information can be supplied by the set 
of live and phantom points which are already uniformly dis¬ 
tributed within the contour. We use the sample covariance 
matrix of the live and phantom points as a proxy for the 
size and shape of the contour. 

Uniformly sampled points remain uniformly sampled 
under an affine transformation. The covariance matrix is 
used to construct an affine transformation which “whitens” 
the contour. Sampling is then performed in this whitened 
space, which we term the sampling space. In the sampling 
space, the contour has size ~ 0(1) in every direction. This 
means that one may choose the initial step size as w = 1. 

To transform from x in the unit hypercube to y in the 
sampling space we use the relation: 

L"^x = y, (19) 

where L is the Cholesky decomposition of the covariance 
matrix S = LL^. This is illustrated further in Figure 4. 

Working in the sampling space our choice of P(n) is 
inspired by the default choice of CosmoMC (Lewis 2013). 
Here, a randomly oriented orthonormal basis is chosen, and 
these directions are chosen in a random order. Once a basis 
is exhausted, a new basis is chosen. This approach satisfies 
detailed balance, and mixes rapidly. 
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Figure 4. Slice sampling in D dimensions. We begin by “whitening” the unit hypercube by making a linear transformation which turns 
a degenerate contour into one with dimensions ^ 0(1) in all directions. This is a linear skew transformation defined by the inverse of the 
Cholesky decomposition of the live points’ covariance matrix. We term this whitened space the sampling space. Starting from a randomly 
chosen live point xq, we pick a random direction and perform one-dimensional slice sampling in that direction (Figure 3), using ui = 1 
in the sampling space. This generates a new point xi in ^ C>(a few) likelihood evaluations. This process is repeated ~ Cl(ndims) times 
to generate a new uniformly sampled point xjy which is decorrelated from xq. 


The choice of rirepeats is slightly harder to justify. We 
find that for distributions with roughly convex contours 
nrepeats~ ©(n-dims) is Sufficient, with the constant of propor¬ 
tionality being 2—6. For more complicated contour shapes, 
one may require much larger values of nrcpeats- 

This procedure has the advantage of being dynamically 
adaptive, and requires no tuning parameters. However, this 
“whitening” process is ineffective for pronounced curving 
degeneracies. This will be discussed in detail in Section 6.4. 


5.3 Clustering 

Multi-modal posteriors are a challenging problem for any 
sampling algorithm. “Perfect” nested sampling (i.e. the en¬ 
tire prior volume enclosed by the iso-likelihood contour is 
sampled uniformly) in theory solves multi-modal problems 
as easily as uni-modal ones. In practice however, there are 
two issues. 

First, one is limited by the resolution of the live points. 
If a given mode is not populated by enough five points, it 
runs the risk of “dying out”. Indeed, a mode may be entirely 
missed if the density of live points is too low. In many cases, 
this problem can be alleviated by increasing the number of 
live points. 

Second, and more importantly for PolyChord, the 
sampling procedure may not be appropriate for multi-modal 
problems. We “whiten” the unit hypercube using the co- 
variance matrix of five points. For far-separated modes, the 
covariance matrix will not approximate the dimensions of 
the contours, but instead falsely indicate a high degree of 
correlation. It is therefore essential for our purposes to have 
PolyChord recognise and treat modes appropriately. 

This methodology splits into two distinct parts: (i) 


recognising that clusters are there, and (ii) evolving the clus¬ 
ters semi-independently. 

5.3.1 Cluster recognition 

Any cluster recognition algorithm can be substituted at this 
point. One must take care that this is not run too often, or 
one runs the risk of adding a large overhead to the calcu¬ 
lation. In practice, checking for clustering every ~ Cl(niive) 
iterations is sufficient, since the prior will have only com¬ 
pressed by a factor e. We encourage users of PolyChord 
to experiment with their own preferred cluster recognition, 
in addition to that provided and described below. 

It should be noted that the five points of nested sam¬ 
pling are amenable to most cluster recognition algorithms 
for two reasons. First, all clusters should have the same den¬ 
sity of five points in the unit hypercube. Second, there is no 
noise (i.e. outside of the likelihood contour there will be no 
live points). Many clustering algorithms struggle when ei¬ 
ther of these two conditions is not satisfied. 

We therefore choose a relatively simple variant of the k- 
nearest neighbours algorithm to perform cluster recognition. 
If two points are within one another’s fc-nearest neighbours, 
then these two points belong to the same cluster. We iter¬ 
ate k from 2 upwards until the clustering becomes stable 
(the cluster decomposition does not change from one k to 
the next). If sub-clusters are identified, then this process is 
repeated on the new sub-clusters. 

5.3.2 Cluster evolution 

An important novel feature comes from what one does once 
clusters are identified. 

First, when spawning from an existing live point, the 
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whitening procedure is now defined by the covariance matrix 
of the live points within that cluster. This solves the issue 
detailed above. 

Second, by choosing a random initial live point as a 
seed, PolyChord would naively spawn live points into a 
mode with a probability proportional to the number of live 
points in that mode. In fact, what it should be doing is to 
spawn in proportion to the volume fraction of that mode. 
These should be the same, but the difference between these 
two ratios will exhibit random-walk like behaviour, and can 
lead to biases in evidence calculations, or worse, cluster 
death. Instead, one can keep track of an estimate of the 
volume in each cluster, and choose the mode to spawn into 
in proportion to that estimate. This methodology is docu¬ 
mented in Appendix C. 

In addition to keeping track of local volumes, we may 
keep track of local evidences. At the moment of splitting, the 
existing evidence in the initial cluster is partitioned between 
the new sub-clusters. Upon algorithm completion, one is left 
with an estimate of the proportion of the evidence contained 
within each cluster, and thus a measure of the importance 
of the various modes. By partitioning the local evidences at 
cluster recognition, the local evidences will sum to give the 
total evidences, to within the error on our inference. 

Thus, the point to be killed off is still the global lowest- 
likelihood point, but we control the spawning of the new live 
point into clusters by using our estimates of the volumes 
of each cluster. We call this ‘semi-independent’, because it 
retains global information, whilst still treating the clusters 
as separate entities. 

When spawning within a cluster, we determine the clus¬ 
ter assignment of the new point by which cluster it is nearest 
to. It does not matter if clusters are identified too soon; the 
evidence calculation will remain consistent. 


5.4 Parallelisation 


PolyChord is parallelised by openMPI using a master- 
slave structure. One master process takes the job of organ¬ 
ising all of the live points, whilst the remaining nprocs — 1 
“slave” processes take the job of finding new live points. This 
layout is optimised for the case where the dominant cost is 
the generation of a new live point due to the calculation of 
relatively expensive likelihoods. 

When a new live point is required, the master process 
sends a random live point and the Cholesky decomposition 
to a waiting slave. The slave then, after some work, signals 
to the master that it is ready and returns a new live point 
and the intra-chain points to the master. 

A point generated from an iso-likelihood contour Li 
is usable as a new live point for an iso-likelihood contour 
Lj > Li, providing it is within both contours. One may keep 
slaves continuously active, and discard any points returned 
which are not usable. The probability of discarding a point 
is proportional to the volume ratio of the two contours, so if 
too many slaves are used, then most will be discarded. The 
parallelisation goes as: 


Speedup(nprocs) = niwe log 


1-f 


Nprocs 

^live 


( 20 ) 


and is illustrated in Figure 5. As a rule, PolyChord paral¬ 
lelises almost linearly up to the number of live points, but 



^procs/^live 


Figure 5. Parallelisation of PolyChord. The algorithm paral¬ 
lelises nearly linearly, providing that nprocs < n^ve. For most 
astronomical applications this is more than sufficient. 


from then on exhibits a law of diminishing returns. Since 
the number of live points is typically high ~ 0(500), this is 
more than sufficient for currently available OPEnMPI archi¬ 
tectures, and certainly superior to the parallelisation of the 
standard Metropolis-Hastings algorithm. 


5.5 Posterior bulking 

In addition to lending information on the scale and shape of 
a contour, phantom points can also be used as posterior sam¬ 
ples. Correlations between samples are unimportant for the 
purposes of parameter estimation, providing one has enough 
to be well mixed. We may thus use the importance weight¬ 
ing detailed in (14) with Wi being set to the volume of the 
live-point shell which they occupy. 

For high-dimensional cosmological applications, this re¬ 
sults in a very large number (f^GB) of posterior samples 
being produced, so PolyChord thins these samples. From 
a user’s perspective, one supplies a parameter which deter¬ 
mines the fraction of phantom points to keep. 


5.6 Fast-slow parameters and CosmoChord 

In cosmological applications, likelihoods can exhibit a hi¬ 
erarchy of parameters in terms of calculation speed (Lewis 
2013). Consequently, a likelihood may be quickly recalcu¬ 
lated if one changes only a certain subset of the parameters. 
For PolyChord it is very easy to exploit such a hierarchy. 
Our transformation to the sampling space is laid out so that 
if parameters are ordered from slow to fast, then this hier¬ 
archy is automatically exploited: a Cholesky decomposition, 
being a upper-triangular skew transformation, mixes each 
parameter only with faster parameters. 

From a user’s perspective, PolyChord does this re¬ 
ordering in the hypercube automatically when provided with 
details of the hierarchy. 

Further to this, one may use the fast directions to ex¬ 
tend the chain length by many orders of magnitude. This 
helps to ensure an even mixing of live points. PolyChord 
automatically times likelihood calculation speeds, so the 
user just has to provide what fraction of time PolyChord 
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should be spending on each subset of the parameters, and 
the algorithm will oversample accordingly. 

5.7 Tuning parameters 

From a user’s perspective, the PolyChord algorithm has 
two tuning paramaters: nuve and Urepeats, which are detailed 
below. 

The authors believe that these tuning parameters are 
fairly straightforward to set in comparison to existing algo¬ 
rithms. More importantly, the number of tuning parameters 
does not scale with the dimensionality of the problem. This 
is in contrast to Metropolis-Hastings and Gibbs sampling, 
which require a proposal matrix to be supplied^. 

There are also several other options controlling run time 
behaviour, such as the production of equally weighted pos¬ 
terior samples, whether or not to perform clustering and the 
production and use of files allowing PolyChord to resume 
from a previous run. These are documented in the input files 
supplied with the code. 

Resolution nnve 

This is a generic nested sampling parameter, n-uve indicates 
the number of live points maintained throughout the algo¬ 
rithm. Increasing nuve causes nested sampling to contract 
more slowly in volume (equation 11), and consequently sam¬ 
ple the space more thoroughly. Thus, it can be thought of 
as a resolution parameter. Run time scales ~ ©(nuve) 

If set too low, posterior modes may be missed. Increas¬ 
ing niive increases the accuracy of the inference of Z, since 
the evidence error scales ~ ■ 

Reliability Urepeats 

This is a PolyChord specific parameter. It corresponds to 
the length of the slice sampling chain used to generate a new 
live point. Increasing this parameter decreases the correla¬ 
tion between live points, and hence increases the reliability 
of the evidence inference. Posterior estimations, however, 
remain accurate even in the event of low rirepeats. 

Setting this too low can result in correlation between 
live points, and unreliable evidence estimates. Typically, set¬ 
ting this ~ 0(3 X ridims) is sufficient, but for curving degen¬ 
eracies one may need significantly longer chains. Run time 
scales ~ O(nrepeats). 


6 PolyChord IN ACTION 

We aim to showcase PolyChord as both a high-dimensional 
evidence calculator, and multi-modal posterior sampler. We 
begin by comparing its dimensionality scaling with Multi- 
Nest. We then demonstrate its clustering capabilities in 
high dimensions, and on difficult clustering problems. Poly¬ 
Chord is shown to perform well on moderately pronounced 

^ Proposal matrices may be learnt during run-time. However, this 
learning step can take a prohibitively long time and reduces the 
efficacy of these approaches. 


0.8 


0.6 



-0.8 I-^^^^^^^^-1 

1 2 4 8 16 32 64 128 

Number of dimensions, D 

Figure 6. Evidence estimates and errors produced by Poly¬ 
Chord for a Gaussian likelihood as a function of dimensionality. 
The dashed fine indicates the correct analytic evidence value. 

curving degeneracies, and its implementation in CosmoMC 
is discussed. 

6.1 High-dimensional evidences 

As an example of the strength of PolyChord as a high¬ 
dimensional evidence estimator, we compare it to Multi- 
Nest on a Gaussian likelihood in D dimensions. In both 
cases, convergence is defined as when the posterior mass 
contained in the live points is 10“^ of the total calculated 
evidence. We set nuve = 2511, so that the evidence error re¬ 
mains constant with D. MultiNest was run in its default 
mode with importance nested sampling and expansion factor 
e = 0.1. Whilst constant efficiency mode has the potential 
to reduce the number of MultiNest evaluations, the low 
efficiencies required in order to generate accurate evidences 
negate this effect. 

With these settings, PolyChord produces consistent 
evidence and error estimates with an error ~ 0.4 log units 
(Figure 6). Using importance nested sampling, MultiNest 
produces estimates that are within this accuracy. 

Figure 7 shows the number of likelihood evaluations Nc 
required to achieve convergence as a function of dimension¬ 
ality D. Even on a simple likelihood such as this, Poly¬ 
Chord shows a significant improvement over MultiNest in 
scaling with dimensionality. PolyChord at worst scales as 
Nc^ whereas MultiNest has an exponential scal¬ 

ing which emerges in higher dimensions. However, we must 
point out that a good rejection algorithm like MultiNest 
will always win in low dimensions. 

6.2 Clustering and local evidences 

To demonstrate PolyChord’s clustering capability we re¬ 
port its performance on a “Twin Peaks” and Rastrigin like¬ 
lihood. 

6.2.1 Twin peaks 

PolyChord is capable of clustering posteriors in very high 
dimensions. We define a twin peaks likelihood as an equal 
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Figure 7. Comparing PolyChord with MultiNest using a 
Gaussian likelihood for different dimensionalities. PolyChord 
has at worst whereas MultiNest has an exponen¬ 

tial scaling that emerges at high dimensions. 


Figure 9. PolyChord cluster identification for the Rastrigin 
function. PolyChord identifies posterior modes and computes 
their local evidences, expressed here as a logarithmic fraction of 
the total evidence in the mode. Dashed lines indicate the analytic 
results computed by a saddle point approximation at each of the 
peaks. As can be seen, PolyChord reliably identifies the inner 
21 modes with increasing accuracy. 



Figure 8. The two-dimensional Rastrigin log-likelihood in the 
range [—1.5,1.5]^. Within this region there are 8 local maxima, 
and one global maximum at (0,0). The clustered samples pro¬ 
duced by PolyChord are plotted on the log-likelihood surface, 
with colours that indicating the separate clusters identified. 


mixture of two spherical Gaussians, separated by a distance 
of 10(7. 

PolyChord correctly identifies these clusters in arbi¬ 
trary dimensions (tested up to D = 100), providing that 
niive and rirepeata are scaled in proportion to D. It calculates 
a global evidence that agrees with the analytic results. In 
addition, the local evidences correctly divide the peaks in 
proportion to their evidence contribution. 

The results for a twin peaks likelihood are of an identical 
character to Figures 6 & 7, and hence not included. 


6.2.2 Rastrigin function 

PolyChord’s clustering capacity is very effective on com¬ 
plicated clustering problems as well. The n-dimensional Ras¬ 


trigin test function is defined by: 

n 

f{6) = An + — Acos(27r0i)] , (21) 

A =10, 6li G [-5.12,5.12]. 

This is the industry standard “bunch of grapes”, the two- 
dimensional version of which is illustrated in Figure 8. For 
our purposes, we will treat (21) as the negative log-likelihood 
so that C{9) (X exp[—/(6')]. This is a stereotypically hard 
problem to solve, as many algorithms get stuck in local max¬ 
ima. 

We ran PolyChord on a two-dimensional Rastrigin 
log-likelihood with mive = 1000 and rirepeats = 6. With 
these settings, PolyChord calculates accurate evidence and 
posterior samples (Figure 8), and in addition correctly iso¬ 
lates and computes local evidences for the inner 21 modes. 
Additional outer modes are also found, but these are com¬ 
binations of lower modes due to their very low posterior 
fraction. Increasing the resolution parameter rinve further 
increases the number of modes identified. Examples of clus¬ 
tered posterior samples are indicated in Figure 9, coloured 
using Green’s (2011) ‘cubehelix’. 

6.3 Rosenbrock function 

PolyChord is also capable of navigating moderate curving 
degeneracies. 

The n-dimensional Rosenbrock function is defined by: 

n — 1 

f{x) = ^{a-Xif+b{xi+i-xlf, (22) 

1=1 

a = l, 6=100, a;iG[-5,5], (23) 

the two-dimensional version of which is plotted in Figure 10. 
This is the industry standard “banana”, as it exhibits an ex¬ 
tremely long and flat curving degeneracy. We consider n = 4, 
in which there is a global maximum at (1,1,1, 1) and a local 
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Figure 10. Density plot of the two-dimensional Rosenbrock func¬ 
tion. The function exhibits a long, thin curving degeneracy, with 
a global maximum at (1,1). 



- 0.6 0.0 0.6 1.2 

Xi 


Figure 11. The four-dimensional Rosenbrock posterior, with X 3 
and X 4 marginalised out. PolyChord correctly identifies both 
the local (red) and global (blue) maxima. 

maximum at (—1,1,1,1), PolyChord finds both of these 
(Figure 11) and produces correct evidence estimations. 

In higher dimensions, PolyChord reliably finds the lo¬ 
cal and global maxima. The lack of an analytic evidence 
value for the Rosenbrock function prevents a verification of 
the evidence calculation. 

6.4 Gaussian shells 

A “Gaussian shell” with mean fj., radius r and width w is 
defined as: 

log£sheii(x|^,r, w) = ^ - .11 ^ Sl . . , (24) 

where A is a normalisation constant that may be calculated 
using a saddle point approximation. This likelihood is cen¬ 
tered on some mean vector /r, and has a radial Gaussian 
profile with width w at distance r from this centre. This 
radial profile is then revolved around /r to create a spher¬ 
ical shell-like likelihood. A two-dimensional version of this 
likelihood is indicated in Figure 12. 

This distribution may be representative of likelihoods 
that one may encounter in beyond-the-Standard-Model 


Figure 12. The two-dimensional Gaussian shell likelihood. 

paradigms in particle physics. In such models, the major¬ 
ity of the posterior mass lies in thin sheets or hypersurfaces 
through the parameter space. 

Running PolyChord on a 100-dimensional Gaussian 
shell with nuve = 1000, rirepeats = 200 yields consistent evi¬ 
dences and posteriors, shown in Figure 13. 

Given that this problem is quoted as being “optimally 
difficult” (Feroz et al. 2009), the ease with which Poly¬ 
Chord tackles this problem in high dimensions is worth 
explanation. In the two-dimensional case, it is clear that 
the posterior mass is concentrated in a very thin, curving 
region of the parameter space. However, as the dimensional¬ 
ity is increased, more and more of the n-sphere’s volume is 
concentrated at the edge, and the thin characteristic of the 
degeneracy is lost. 

This may mean that the Gaussian shell is not a good 
proxy for a high-dimensional curving degeneracy. However, 
it could equally suggest that curving degeneracies become 
easier to navigate in higher dimensions. We can certainly 
conclude that a particle physics model with a proliferation 
of phases would be easier to navigate than one with a smaller 
number of phases. 

6.4-i Twin Gaussian shells 

We finish our toy problems by combining the difficulties of 
multimodality (Section 6.2) and degeneracy, by mixing two 
twin Gaussian shells together: 

£(x) oc £sheii(x|^i, r, w) + £sheii(x|/r 2 , r, w). (25) 

We choose r = 2, w = 0.1, and and /i 2 are separated by 
T units. With Ulive ~ lOnuims and Tlrepeats ~ 2?T.clims, POLY- 
Chord successfully computes the local and global posteriors 
and evidences up to D = 100, and reliably identifies the two 
modes. The comparison of run times with MultiNest recov¬ 
ers a similar pattern to Figure 7, although in our experience, 
the MultiNest parameters require some tuning to ensure 
that evidences are calculated correctly when ndims > 30. 

6.5 GosmoChord 

An additional strength of PolyChord lies in its ability to 
exploit a fast-slow hierarchy common in many cosmologi- 
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Figure 14. CosmoChord (red) vs. CosmoMC (black). We use the 2013 CAMSPEC+commander likelihoods with a standard six-parameter 
ACDM cosmology, varying all 14 nuisance parameters(Planck Collaboration et al. 2014). We compare the 1 and 2-dimensional 
marginalised posteriors of the 6 ACDM parameters. CosmoChord is in close agreement with the posteriors produced by CosmoMC, 
recovering the correct mean values of and degeneracies between the parameters. 


cal applications. We have successfully implemented Poly- 
Chord within CosmoMC, and term the result Cosmo¬ 
Chord. The traditional Metropolis-Hastings algorithm is 
replaced with nested sampling. This implementation is avail¬ 
able to download from the link at the end of the paper. 

The exploitation of fast-slow parameters means that 
CosmoChord vastly outperforms MultiNest when run¬ 
ning with modern Planck likelihoods. 

CosmoMC by default uses a Metropolis-Hastings sam¬ 


pler. If this has a well-tuned proposal distribution (e.g. if one 
is performing importance sampling from an already well- 
characterised likelihood), then PolyChord is 2-4 times 
slower than the traditional CosmoMC. If proposal matri¬ 
ces are unavailable (e.g. in the case that one is examining 
an entirely new model) then CosmoChord’s run time is sig- 
nihcantly faster than the native CosmoMC sampler. This is 
a good example of the self-tuning capacity of PolyChord, 
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Figure 13. Posteriors produced by PolyChord for a n = 100- 
dimensional Gaussian shell, with width w = 0.1, radius r = 2, 
and center /i = 0. Plotting the marginalised posteriors for the 
Cartesian sampling parameters {xi, ■ ■ ■ , x„} yields Gaussian dis¬ 
tributions centered on the origin. To see the effectiveness of the 
sampler it is better to plot the sampling parameters in terms of n- 
dimensional spherical polar coordinates Note 

that the polar coordinates are derived parameters, and that the 
sampling space still has the strong Gaussian shell degeneracy. In 
this case we can see that the radial coordinate has a Gaussian 
profile centered on ro = r X ^ ^1 + ^1 + 4(n — l)(in/r)^^ with 

width WQ = ui(l-|-(n— l)('!n/ro)^) The azimuthal coordi¬ 

nate </>„_! has a uniform posterior, and the other angular coor¬ 
dinates {(pi} have posteriors defined by P((pi) oc (sin. 


nested sampling. It can identify and evolve separate modes 
of a posterior semi-independently and is parallelised using 
OpenMPI. 
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DOWNLOAD LINK 

PolyChord is available for download from: http:// 
ccpforge.cse.rl.ac.uk/gf/project/polychord/ 


APPENDIX A: PRIOR TRANSFORMATIONS 

Here we give examples of the procedure for calculating the 
transformation from the unit hypercube to the physical 
space. We demonstrate it for a simple separable case, and a 
more complicated dependent case 

To recap, we aim to compute the inverse of the functions 
Fi-. 


Si 

Fi{9i\9i-i,... ,9 q) = J TTi{9i\9i-i,..., 9\)d9'i, 
0 


where: 


■Ki{9i\9i-i,... ,9o) 


f Tri(9)d9i+i ... d9N 
f TVi{9)d9i .. . d9N 


(Al) 


(A2) 


F maps from 9 in the physical space onto the unit hypercube 
injectively. 


since it only requires two tuning parameters, as opposed to 
~0(D). 

CosmoChord produces parameter estimations consis¬ 
tent with CosmoMC (Figure 14). It has been implemented 
effectively in multiple cosmological applications in the lat¬ 
est Planck paper describing constraints on inflation (Planck 
Collaboration XX 2015), including application to a 37- 
parameter reconstruction problem (4 slow, 19 semi-slow, 
14 fast). In addition, PolyChord is an integral compo¬ 
nent of the ModeChord code, a combination of Cosmo¬ 
Chord and ModeCode (Mortonson et al. 2011; Easther 
& Peiris 2012; Norena et al. 2012), which is available at 
http://modecode.org/. 


7 CONCLUSIONS 

We have introduced PolyChord, a novel nested sampling 
algorithm tailored for high-dimensional parameter spaces. 
It is able to fully exploit a hierarchy of parameter speeds 
such as is found in CosmoMC and CAMB (Lewis & Bridle 
2002; Lewis et al. 2000). It utilises slice sampling at each 
iteration to sample within the hard likelihood constraint of 


Al Separable priors 

A separable prior satisfies: 

Tr)^) = r^7ri(6'i). (A3) 

i 

This has the fortunate side effect that the functions Fi only 
depend on 9i: 

Fi{9i\9,-i,...,9Q)=F^{9^). (A4) 

Solving a separable prior thus amounts to solving a 
one-dimensional inverse-transform sampling problem. We 
demonstrate this procedure for two cases, a rectangular uni¬ 
form prior, and a Gaussian prior. 


A 1.1 Uniform prior 

A rectangular uniform prior is defined by two parameters, 

^min, ^max. 

f n\ J (^max ^min ) for 9 

max <9i <9 

min / A cr \ 

= 1 0 otherwise. 
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Computing F{6) we find: 


n0 

F{e) = / n{e')de', 

J —oo 


6 — 6n 


^max ^min 


(A6) 


with F = 0 or 1 either side of 6min and 0max respectively. 
Inverting the equation F{6) = x we find: 


d — ^min “t“ (^max ^min):r, 


(A7) 


is the transformation from x in the unit hypercube to 9 in 
the physical space. 


A1.2 Gaussian prior 


Defining a Gaussian prior with mean p and standard devi¬ 
ation a: 


Tv{e) = 


exp 


jx - p) 

2cr2 


We find that the procedure above yields: 

9 = 11 + \/2o'erfinv(2a: — 1), 


(A8) 


(A9) 


where erfinv is the conventional inverse error function. 


A2 Forced identifiability priors 


As an example of a prior that is not separable in the param¬ 
eters, we consider a forced identifiability prior. Here, n pa¬ 
rameters are distributed uniformly between 9mm and 6'max, 
but subject to the constraint that they are ordered numeri¬ 
cally. This is a particularly useful prior in the reconstruction 
of functions using a spline with movable knots (Vazquez 
et al. 2012; Aslanyan et al. 2014; Abazajian et al. 2014; 
Planck Collaboration XX 2015). In this case, the horizontal 
locations of the knots must be ordered. 

The required prior is uniform in the hyper-triangle de¬ 
fined by ^min < 9i < ■ ■ ■ < 9n < 0max, and zero everywhere 
else: 

w for 9 min <9t_<-- - <9„ <9 

max 

0 otherwise. 

(AlO) 

To calculate equations (Al & A2) we simply integrate 
over the constant distribution, taking care with the limits. 
We find: 



-Ki{9i\9i-\,. ..,9o) 
Fi{9i\9i-i, ■ ■ ■ ,9 q) 


{n-i + l){9i - di-iY 


(^max ^min) 


9i — 9i-l 


n —i+1 


^max ^i — 1 


n —i + l 


(All) 

(A12) 


where for consistency we define 9q = 9min- Hence solving 
Xi = F{9^\9i-i, ...,9o) for 9i we find: 

9i = 9i.i + {9m.^ - 6ii_i)a;JA—»+i), (A 13 ) 


This enables {9i} to be calculated sequentially from {xi}. 
We may interpret this transformation as 9i being distributed 
as the smallest of n — i -|- 1 uniformly distributed variables 
in the range 6linax]. 


APPENDIX B: EVIDENCE ESTIMATES AND 
ERRORS 

Skilling (2006) initially advocated using Monte-Carlo meth¬ 
ods to estimate the evidence error, although this requires 
the storage of the entire chain of dead points, rather than 
just the subset usually stored for posterior inferences. For 
high-dimensional problems, the number of dead points is 
prohibitively large, and cannot be stored. 

Feroz et al. (2009) use an alternative method based on 
the relative entropy (also suggested by Skilling (2006)). 

Keeton (2011) suggests a more intuitive methodology 
of estimating the error, and it is this which we use, although 
it must be heavily adapted for the case of variable numbers 
of live points and clustering. 


B1 Basic theory 

We wish to compute the sum: 

.E = ^(W_i-W)T.. (Bl) 

i 

However, we do not know the volumes Xi exactly, so we 
can only make inferences about Z, in terms of a probability 
distribution P(Z). In practice, all we need to compute is the 
mean and variance of this distribution: 

mea.n{Z) = Z, (B2) 

var(2)=^-2^ (B3) 

At iteration i, the nuve live points are each uniformly sam¬ 
pled within a contour of volume Xi-\. The volume Xi will 
be the largest volume out of niive uniform volume samples 
in volume Xi. Thus Xi satisfies the recursion relation: 

Xi = tXi-r, Xo = l, (B4) 

P{t) = (B5) 

where the t and Xi_i are independent. 

It is worth noting that the procedure described below 
will generate the mean and variance of the distribution, but 
in fact this is not quite what we want. The evidence is in 
practice approximately log-normally distributed. Thus, it is 
better to report the mean and variance of logZ, defined by: 

mean(log.Z) = 21og:Z- ^log^, (B6) 

var(log^) = log^- 21og2. (B7) 


B2 Computing the mean evidence 

While it is possible to take equations (B1,B4 & B5) and 
compute the mean as a general formula (Keeton 2011), in 
the case of clustering this is uninformative. In fact, for large¬ 
dimensional spaces using the full formula would require stor¬ 
age of a prohibitively large amount of data. The calculation 
is better accomplished by a set of recursion relations, which 
update the mean evidence and its error at each step. 

For now, assume that we have n live points currently 
enclosed by some likelihood contour £ of volume X, and Z 
is the last value of the evidence calculated from all of the 
points that have died so far. By considering (B1,B4&B5), 
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when we kill off the outermost point, we may adjust the 
values of Z and X using: 

Z ^ Z + {l-t)XC, (B8) 

X tX. (B9) 

Taking the mean of these relations, we may use the facts 
that t and X are independent random variables and that 
P{i) = nt"~^, to find the recursion relations: 

Z^Z+—^X£, (BIO) 

n + 1 

X^HL^X. (Bll) 

n + 1 ^ 


B3 Computing the evidence error 

To estimate Z^, we square (B8) and (B9) and multiply both 
together to obtain: 

Z^ ^ Z^ + 2(1 - t)ZX/: + (1 - tfx^c^, (B12) 
ZX ^tzx + t{l-t)X^C, (B13) 

t^X^ (B14) 


Cl Evidence 


We thus need to update the global evidence, the local evi¬ 
dence of cluster p, and the volume of cluster p: 


Z^Z + {1- t)XpC, (Cl) 

Zj,^ Zp + {1- t)XpC, (C2) 

Xp tXp. (C3) 


Since t will be distributed with P{t) = ript"’’ taking the 
mean of these yields: 


np + 1 

Zp^Zp+ 


Up + l' 


Xp 


TXpX p 

rip + I' 


(C4) 

(C5) 

(C6) 


Keeping track of {Z,Zp,Xp,p = 1... m} and updating 
them using the recursion relations in the order above will 
produce a consistent evidence estimate for both the local 
and global evidence errors. 


C2 Evidence errors 


Note that we now need to keep track of the variable ZX, as 
these two are not independent. Taking the averages of the 
above yields: 


2ZXC X2£2 

•72 _v (72 j_I_ 

n-bl (n-f l)(n-b2)’ 
^ n~ZX nlHC 

n -I- 1 (n -I- l)(n -|- 2) ’ 


(B15) 

(B16) 

(BIT) 


B4 The full calculation 

There are therefore five quantities to keep track of: 


We must also keep track of the local and global evidence 
errors. Taking the square of equations (Cl & C2) yields: 


Z'^ ^ Z'^ + 2(1 - t)ZXpC -b (1 - tfxlH, (C7) 

Zl^Zl + 2(1 - t)ZpXpC -b (1 - tfxlH. (C8) 

We can see that we’re going to need to keep track of 
{ZXp, ZpXp, Xp} in addition to }Z'^^Zp}. Taking various 
multiplications of equations (Cl, C2 & C3) finds: 

ZXp tZXp -b (1 - t)tXlL, (C9) 

ZX^^ ZX^^{\-t)XpXqL {p + q), (CIO) 
ZpXp^tZpXp^[\-t)tXlL, (Cll) 

Xl t^Xl, (C12) 

XpX, tXpX^. (C13) 


X, X2. 

These should be initialised at {0, 0, 0,1,1} respectively, and 
updated using equations (B10,B12,B13,B11,B14) in that or¬ 
der. In fact, we keep track of the logarithm of these quanti¬ 
ties, in order to avoid machine precision errors. 


APPENDIX C: EVIDENCE ESTIMATES AND 
ERRORS IN CLUSTERS 

This analysis follows that of Appendix B. We recommend 
that you have understood the methods described there be¬ 
fore continuing. 

Throughout the algorithm, there will in general be m 
identified clusters. In doing so, we wish to keep track of the 
volume of each cluster {Xi,...,Xm}, the global evidence 
and its error Z, Z'^ and the local evidences and their errors 
{Zi,Zi ..., Zm, Z^}. At each iteration, the point with the 
lowest likelihood £ will be killed from cluster p, (1 ^ p ^ m). 


Taking the mean of the above yields the recursion relations: 


^ ^ 1 '2‘ZXp£p 2X^£ 

(C14) 

^ ^ , 2X2£2 

np + l inp + l){np + 2)’ 

(C15) 

^ ^ TlpZXp TlpXpH 

^ ^ rip + l {np + l){np -b 2) ’ 

(C16) 


(C17) 

■c? V <. ^p-^p^p 1 TlpXpC. 

^ ^ Up-b 1 (up-b l)(np-b 2) ’ 

(CIS) 

^ np-b2’ 

(C19) 

V V V '^p^p^q 

+ l ■ 

(C20) 

Keeping track of 



{^2^ 22^ 2Xp, ZpXp, X2, XpX„p, 
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and updating them using the recursion relations in the order 
above will produce a consistent estimate for the local and 
global evidence errors. 


C3 Cluster initialisation 


All that remains is to initialise the clusters correctly at the 
point of creation. 

The starting initialisation of the evidence and volume 
is reasonable, there will be only a single cluster with volume 
1, and all evidence related terms 0. At some point (possibly 
at the beginning, depending on the prior), the live points 
will split into distinct clusters, and the local volumes and 
evidences will need to be re-initialised. 

At the point of splitting a cluster into sub-clusters, 
we partition the n live points into a N new clusters, with 
{«!,..., riiv} live points in each. If the volume of the split¬ 
ting cluster is Xp initially, we need to know how to partition 
this volume into {Xi, ..., A]v}. If the points are drawn uni¬ 
formly from the volume, then the rii will depend on the 
volumes via a multinomial probability distribution: 

P{{ni}\Xp, {Xi}) oc Xi^X .. X"« . (C21) 


We however want to know the probability distributions of 
the {Xi}, given the {rii}. We can invert the above with 
Bayes’ theorem, using an (improper) logarithmic prior on 
the volumes subject to the constraint that they sum to Xp-. 

P{{Xi}\Xp) (X (^22) 

Doing this shows the posterior P{{Xi}\Xp, {rn}) is a Dirich- 
let distribution with parameters {rii}. More importantly, we 
can use this to compute the means and correlations for the 
volumes {Xi}: 


A p, 

n 

(C23) 

ni{ni + 1)^^ 
n{n + l) 

(C24) 

mrij ^ 
n(n-hl) 

(C25) 

—XpY Y£{Z,Zp,X^}. 

n 

(C26) 


The first equation recovers the intuitive result that the vol¬ 
ume should split as the fraction of live points. Note, however 
that this requires a logarithmic prior. The third shows us 
that since XiXj ^ Xi Xj, the volumes are correlated at the 
splitting. This is to be expected. 

We also need to initialise the local evidences and their 
errors. A consistent approach is to assume that the evidences 
also split in proportion to the cluster distribution of live 
points. Following the same reasoning as above, we find that: 


Zji — Zj -p 

n 


ZiXi = 


ni{ni + 

n(n+l) 


„2 _ rii{ni -f 1) 

' n(n-f 1) P 


(C27) 

(C28) 

(C29) 

(C30) 


Thus, at cluster splitting, all of the new local evidences, 


volumes and cross correlations are initialised according to 
the above. 

This completes the mechanism for keeping track of the 
local and global evidences, their errors, and the local cluster 
volumes. 


REFERENCES 

Abazajian, K. N., Aslanyan, G., Easther, R., & Price, L. C. 

2014, ArXiv e-prints, arXiv: 1403.5922 
Aitken, S., & Akman, O. 2013, BMC Systems Biology, 7, 
doi:10.1186/1752-0509-7-72 

Aslanyan, G., Price, L. C., Abazajian, K. N., & Easther, 
R. 2014, ArXiv e-prints, arXiv:1403.5849 
Betancourt, M. 2011, in American Institute of Physics 
Conference Series, Vol. 1305, American Institute of 
Physics Conference Series, ed. A. Mohammad-Djafari, J.- 
F. Bercher, & P. Bessiere, 165-172 
Brewer, B. J., Partay, L. B., & Csanyi, G. 2009, ArXiv 
e-prints, arXiv:0912.2380 

Easther, R., & Peiris, H. V. 2012, Phys.Rev., D85, 103533 
Feroz, F., & Hobson, M. P. 2008, MNRAS, 384, 449 
Feroz, F., Hobson, M. P., & Bridges, M. 2009, MNRAS, 
398, 1601 

Feroz, F., Hobson, M. P., Cameron, E., & Pettitt, A. N. 

2013, ArXiv e-prints, arXiv:1306.2144 

Feroz, F., & Skilling, J. 2013, in American Institute of 
Physics Conference Series, Vol. 1553, American Institute 
of Physics Conference Series, ed. U. von Toussaint, 106- 
113 

Green, D. A. 2011, Bulletin of the Astronomical Society of 
India, 39, 289 

Handley, W. J., Hobson, M. P., & Lasenby, A. N. 2015, 
ArXiv e-prints, arXiv:1502.01856 
Keeton, C. R. 2011, MNRAS, 414, 1418 
Lewis, A. 2013, Phys. Rev. D, 87, 103529 
Lewis, A., & Bridle, S. 2002, Phys. Rev. D, 66, 103511 
Lewis, A., Challinor, A., & Lasenby, A. 2000, Astrophys. 
J., 538, 473 

MacKay, D. J. C. 2002, Information Theory, Inference & 
Learning Algorithms (New York, NY, USA: Cambridge 
University Press) 

Mortonson, M. J., Peiris, H. V., & Easther, R. 2011, 
Phys.Rev., D83, 043505 

Neal, R. M. 2000, ArXiv Physics e-prints, physics/0009028 
Norena, J., Wagner, C., Verde, L., Peiris, H. V., & Easther, 
R. 2012, Phys.Rev., D86, 023505 
Planck Collaboration, Ade, P. A. R., Aghanim, N., et al. 

2014, A&A, 571, A15 

Planck Collaboration XX. 2015, ArXiv e-prints, 
arXiv:1502.02114 

Sivia, D. S., & Skilling, J. 2006, Data analysis : a Bayesian 
tutorial, Oxford science publications (Oxford, New York: 
Oxford University Press) 

Skilling, J. 2006, Bayesian Analysis, 1, 833 
Vazquez, J. A., Bridges, M., Hobson, M. P., & Lasenby, 
A. N. 2012, J. Cosmology Astropart. Phys., 6, 6 


© 2015 RAS, MNRAS 000, 1-15 



